1 Introduction

What are we trying to study ?

The Enzyme Commission (EC) system is a widely accepted classification system used to categorize enzymes based on their catalytic activities. Enzymes play crucial roles in biological processes by accelerating chemical reactions and facilitating various metabolic pathways within living organisms. The EC system assigns a unique EC number to each enzyme, which provides valuable information about its function and specificity.

EC1 represents the class of enzymes known as oxidoreductases. These enzymes catalyze oxidation-reduction reactions, which involve the transfer of electrons between molecules. Oxidoreductases are involved in a wide range of biological processes, including energy production, biosynthesis, and detoxification. Examples of oxidoreductases include dehydrogenases, oxidases, reductases, and peroxidases.

The Enzyme Commission (EC) system provides a systematic and standardized approach for classifying enzymes based on their catalytic activities. EC1 represents the class of oxidoreductases, which participate in oxidation-reduction reactions, while the second digit in the EC number provides more specific information about the enzyme’s function. This classification system facilitates the study of enzymes and helps researchers gain insights into their structure, function, and potential applications in various fields, including medicine, biotechnology, and industrial processes.

- The internet

2 Importing the relevant libraries and dataset 🛠️

First, we import the required libraries which we will use to perform the current analysis.

library(tidyverse)
library(naniar)
library(bookdown)
library(stringr)
library(stringi)
library(lubridate)
library(DT)
library(forcats)
library(ggthemes)
library(corrplot)
library(mltools)
library(data.table)
library(visdat)
library(janitor)
library(cowplot)
library(caTools)
library(pscl)
library(ROCR)
library(caret)
library(xgboost)
library(randomForest)
library(lightgbm)
library(Matrix)
library(catboost)
library(magrittr)
library(fmsb)

Great ! We have all the libraries loaded. Next, we are gonna load the required dataset for conducting the enzyme classification analysis.

We will use one dataset for the purpose of exploratory data analysis and training the classification model while the test dataset for testing the classification model on a completely new dataset.

After reading the data, let us see how the train dataset looks like.

df_train <- read_csv("data/train.csv")
df_test <-  read_csv("data/test.csv")
head(df_train)
## # A tibble: 6 × 38
##      id BertzCT  Chi1 Chi1n Chi1v Chi2n Chi2v Chi3v Chi4n EState_VSA1
##   <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>       <dbl>
## 1     0    323.  9.88  5.88  5.88  4.30  4.30  2.75 1.75          0  
## 2     1    274.  7.26  4.44  5.83  3.29  4.49  2.20 1.29         45.1
## 3     2    522. 10.9   8.53 11.1   6.67  9.52  5.82 1.77         15.6
## 4     3    567. 12.5   7.09 12.8   6.48 11.0   7.91 3.07         95.6
## 5     4    113.  4.41  2.87  2.87  1.88  1.88  1.04 0.728        18.0
## 6     5    145.  5.54  3.48  3.48  2.48  2.48  1.51 0.671        36.9
## # ℹ 28 more variables: EState_VSA2 <dbl>, ExactMolWt <dbl>,
## #   FpDensityMorgan1 <dbl>, FpDensityMorgan2 <dbl>, FpDensityMorgan3 <dbl>,
## #   HallKierAlpha <dbl>, HeavyAtomMolWt <dbl>, Kappa3 <dbl>,
## #   MaxAbsEStateIndex <dbl>, MinEStateIndex <dbl>, NumHeteroatoms <dbl>,
## #   PEOE_VSA10 <dbl>, PEOE_VSA14 <dbl>, PEOE_VSA6 <dbl>, PEOE_VSA7 <dbl>,
## #   PEOE_VSA8 <dbl>, SMR_VSA10 <dbl>, SMR_VSA5 <dbl>, SlogP_VSA3 <dbl>,
## #   VSA_EState9 <dbl>, fr_COO <dbl>, fr_COO2 <dbl>, EC1 <dbl>, EC2 <dbl>, …

We can observe that there are multiple process parameters present in the dataset which can help us analyse and predict the values of EC1 and EC2. But what do all these variables tell us ? Based on the information from Panmie’s Kaggle notebook, following are the explanations of each of these variables.

  1. Id: This feature represents the identifier or unique identification number of a molecule. It serves as a reference but doesn’t directly contribute to the predictive model.

  2. BertzCT: This feature corresponds to the Bertz complexity index, which measures the structural complexity of a molecule. It can provide insights into the intricacy of molecular structures.

  3. Chi1 : The Chi1 feature denotes the 1st order molecular connectivity index, which describes the topological connectivity of atoms in a molecule. It characterizes the atomic bonding pattern within the molecule.

  4. Chi1n : This feature is the normalized version of the Chi1 index. It allows for standardized comparisons of the 1st order molecular connectivity across different molecules.

  5. Chi1v : The Chi1v feature represents the 1st order molecular variance connectivity index. It captures the variance or diversity in the connectivity of atoms within a molecule.

  6. Chi2n : The Chi2n feature indicates the 2nd order molecular connectivity index, which provides information about the extended connectivity of atoms in a molecule. It considers the neighboring atoms of each atom in the molecule.

  7. Chi2v : Similar to Chi2n, the Chi2v feature measures the variance or diversity in the extended connectivity of atoms within a molecule at the 2nd order level.

  8. Chi3v : The Chi3v feature represents the 3rd order molecular variance connectivity index. It captures the variance in the 3rd order connectivity patterns among atoms in a molecule.

  9. Chi4n : This feature corresponds to the 4th order molecular connectivity index, which provides information about the extended connectivity of atoms in a molecule. The Chi4n index is normalized to allow for consistent comparisons across molecules.

  10. EState_VSA1 : EState_VSA1 is a feature that relates to the electrotopological state of a molecule. Specifically, it represents the Van der Waals surface area contribution for a specific atom type, contributing to the overall electrotopological state.

  11. EState_VSA2 : Similar to EState_VSA1, EState_VSA2 also represents the electrotopological state but for a different specific atom type.

  12. ExactMolWt : This feature denotes the exact molecular weight of a molecule. It provides an accurate measurement of the mass of the molecule.

  13. FpDensityMorgan1 : FpDensityMorgan1 represents the Morgan fingerprint density for a specific radius of 1. Morgan fingerprints are a method for generating molecular fingerprints, and this feature captures the density of those fingerprints.

  14. FpDensityMorgan2 : Similar to FpDensityMorgan1, this feature represents the Morgan fingerprint density for a specific radius of 2.

  15. FpDensityMorgan3 : FpDensityMorgan3 corresponds to the Morgan fingerprint density for a specific radius of 3.

  16. HallkierAlpha : The HallkierAlpha feature denotes the Hall-Kier alpha value for a molecule. It is a measure of molecular shape and can provide insights into the overall structure of the molecule.

  17. HeavyAtomMolWt : This feature represents the molecular weight of heavy atoms only, excluding hydrogen atoms. It focuses on the mass of non-hydrogen atoms within the molecule.

  18. Kappa3 : The Kappa3 feature corresponds to the Hall-Kier Kappa3 value, which is a molecular shape descriptor. It provides information about the shape and spatial arrangement of atoms within the molecule.

  19. MaxAbsEStateIndex : This feature represents the maximum absolute value of the E-state index. The E-state index relates to the electronic properties of a molecule, and its maximum absolute value can indicate the presence of specific electronic characteristics.

  20. MinEStateIndex : MinEStateIndex denotes the minimum value of the E-state index. It provides information about the lowest observed electronic property value within the molecule.

  21. NumHeteroatoms : This feature indicates the number of heteroatoms present in a molecule. Heteroatoms are atoms other than carbon and hydrogen, such as oxygen, nitrogen, sulfur, etc. This feature provides insights into the diversity and composition of atoms within the molecule.

  22. PEOE_VSA10 : PEOE_VSA10 represents the partial equalization of orbital electronegativity Van der Waals surface area contribution for a specific atom type. It captures the surface area contribution of a particular atom type to the overall electrostatic properties.

  23. PEOE_VSA14 : Similar to PEOE_VSA10, PEOE_VSA14 also represents the partial equalization of orbital electronegativity Van der Waals surface area contribution for a specific atom type.

  24. PEOE_VSA6 : This feature corresponds to the partial equalization of orbital electronegativity Van der Waals surface area contribution for a specific atom type at a different level.

  25. PEOE_VSA7 : Similar to PEOE_VSA6, PEOE_VSA7 represents the partial equalization of orbital electronegativity Van der Waals surface area contribution for a specific atom type.

  26. PEOE_VSA8 : PEOE_VSA8 denotes the partial equalization of orbital electronegativity Van der Waals surface area contribution for a specific atom type.

  27. SMR_VSA10 : SMR_VSA10 represents the solvent-accessible surface area Van der Waals surface area contribution for a specific atom type. It captures the contribution of a specific atom type to the solvent-accessible surface area.

  28. SMR_VSA5 : Similar to SMR_VSA10, this feature denotes the solvent-accessible surface area Van der Waals surface area contribution for a specific atom type at a different level.

  29. SlogP_VSA3 : The SlogP_VSA3 feature represents the LogP-based surface area contribution. It captures the contribution of a specific atom type to the surface area based on its logarithmic partition coefficient.

  30. VSA_EState9 : This feature denotes the E-state fragment contribution for the Van der Waals surface area calculation. It captures the fragment-specific contribution to the electrostatic properties of the molecule.

  31. fr_COO : The fr_COO feature represents the number of carboxyl (COO) functional groups present in the molecule. It ranges from 0 to 8, providing insights into the presence and abundance of carboxyl groups.

  32. fr_COO2 : Similar to fr_COO, fr_COO2 represents the number of carboxyl (COO) functional groups, ranging from 0 to 8.

  33. EC1 : EC1 is a binary feature representing a predicted label related to Oxidoreductases. It serves as one of the target variables for prediction.

  34. EC2 : EC2 is another binary feature representing a predicted label related to Transferases. It serves as another target variable for prediction.

3 Data cleaning

3.1 Removal of unnecessary variables

In the first section, we will try to remove all the variables that will not be required for our analysis.

df_train <- df_train %>% select(-c("id","EC3","EC4","EC5","EC6"))

3.2 Check for null values

In this step, we will try to check for the presence of null values in the dataset.

gg_miss_var(df_train)
Missingness in the dataset

Figure 3.1: Missingness in the dataset

Based on the figure 3.1, we can observe that

✅ The dataset does not contain any missing values. This indicates that we have a clean dataset which is ready for EDA and further analysis.

4 Exploratory Data Analysis

We can observe that there are a total of 32 variables in the current dataset !!! These are a lot more than what we would ideally like to analyse. Such types of datasets require a special kind of analysis called as High Dimensional Data Analysis which concentrate majorly on techniques such as clustering and pricipal component analysis to reduce the number of variables without completely losing data. While this is the right way to go about it, this notebook will however study the correlation of each variable and try to reduce the number of variables which are observed to show high multi-collinearity.

4.1 Correlation plot

Let us understand how each of these variables correlate.

corrplot(cor(df_train),        # Correlation matrix
         method = "number", # Correlation plot method
         type = "full",    # Correlation plot style (also "upper" and "lower")
         diag = TRUE,      # If TRUE (default), adds the diagonal
         tl.col = "black", # Labels color
         bg = "white",     # Background color
         title = "Correlation plot",       # Main title
         col = NULL,
         number.cex = 0.4,
         tl.cex = .5)
Correlation plot

Figure 4.1: Correlation plot

As we can observe from figure 4.1,

None of the variables have an unusually high correlation with EC1 or EC2 . However, we do observe multiple variables which have high correlation to each other. This pheonmenon is called multi-collinearity . Let us set a correlation threshold of 75%. Any variables with correlation values higher than this will be dropped from the dataset.

df_corr = cor(df_train)
hc = findCorrelation(df_corr, cutoff=0.75) # Removing variables with greater than 75% correlation
hc = sort(hc)
df_train_new = df_train[,-c(hc)]

Now that we have removed the variables that observed to show multi-collinearity, let us now see how the revised dataset looks like.

head(df_train_new) %>%
DT::datatable(width = 500, height = 500, options = list(pageLength = 6))

Let us now create the correlation plot of the revised dataset.

corrplot(cor(df_train_new),        # Correlation matrix
         method = "number", # Correlation plot method
         type = "full",    # Correlation plot style (also "upper" and "lower")
         diag = TRUE,      # If TRUE (default), adds the diagonal
         tl.col = "black", # Labels color
         bg = "white",     # Background color
         title = " ",       # Main title
         col = NULL,
         number.cex = 0.6,
         tl.cex = .5)
Correlation plot of revised dataset

Figure 4.2: Correlation plot of revised dataset

Figure 4.2 depicts the correlation values of all the variables which do not observe to demonstrate multi-collinearity.

4.2 Univariate analysis

Now that we have figured out the variables of interest, we will perform a univariate analysis of the revised dataset. One of the best ways to study the overall distribution of the variables is through a faceted histogram. Let us dive deeper.

pl1 <- ggplot(data = gather(df_train_new), aes(x = value,fill = factor(key))) + geom_histogram() + facet_wrap(~key,scales ="free_x") + theme_classic() + ggtitle("Univariate analysis of variables") + theme(legend.position = "none",plot.title = element_text(hjust = 0.5)) + labs(y="Number of instances",x = "Value of variable")
pl1
Univariate analysis of variables

Figure 4.3: Univariate analysis of variables

Based on our analysis of figure 4.3, we can observe that

while most variables range over a large scale in the X-axis, certain variables, namely FpDensityMorgan1 and Kappa3 range over a very small scale on the X-axis. This indicates that there is a large scale difference among the various variables. Hence, the dataset could benefit from standardisation technique at a later point of the analysis.

4.3 Multi-variate analysis

Now that we have performed a univariate analysis, it is now time to perform a multi-variate analysis to understand our variables better.

4.4 FpDensityMorgan1 and Kappa3

Let us observe how do these variables differ for each values of EC.

pl2 <- ggplot(data = df_train_new %>% filter(FpDensityMorgan1 > -5), aes(x =FpDensityMorgan1, y=Kappa3)) + geom_point(aes(color = factor(EC1))) + theme_classic() + labs(color = "EC1 indicator")
pl2
Kappa3 Vs FpDensityMorgan1 for EC1

Figure 4.4: Kappa3 Vs FpDensityMorgan1 for EC1

pl2 <- ggplot(data = df_train_new %>% filter(FpDensityMorgan1 > -5), aes(x =FpDensityMorgan1, y=Kappa3)) + geom_point(aes(color = factor(EC2))) + theme_classic() + labs(color = "EC2 indicator")
pl2
Kappa3 Vs FpDensityMorgan1 for EC1

Figure 4.5: Kappa3 Vs FpDensityMorgan1 for EC1

Based on the analysis from figures 4.4 and 4.5, we can observe that

💡Most values of Kappa3 lie around 0 for both EC1 and EC2. There are very few datapoints which are higher than 0. Hence, majority of the datapoints can be clustered into one group.💡

4.5 EState_VSA2 and PEOE_VSA8

These variables were observed to have a fair amount of correlation to each other. Let us try to visualise these parameters.

pl3 <- ggplot(data = df_train_new, aes(x = EState_VSA2, y= PEOE_VSA8,color = factor(EC1))) + geom_point() + theme_classic() + labs(color = "EC1 indicator")
pl3
EState_VSA2 Vs PEOE_VSA8 for EC1

Figure 4.6: EState_VSA2 Vs PEOE_VSA8 for EC1

pl4 <- ggplot(data = df_train_new, aes(x = EState_VSA2, y= PEOE_VSA8,color = factor(EC2))) + geom_point() + theme_classic() + labs(color = "EC2 indicator")
pl4
EState_VSA2 Vs PEOE_VSA8 for EC2

Figure 4.7: EState_VSA2 Vs PEOE_VSA8 for EC2

Based on figures 4.6 and 4.7, we can observe that

💡there is no strong relationship for the two variables in each of the indicators.💡

While we have reduced the number of variables for analysis, we still have a sizable number of variables to analyse. Let us instead plot the variable importance and choose the top 5 variables to study.

4.6 Feature importance

To study feature importances, let us use the XGBoost algorithm.

set.seed(101)
df_train_new_EC1 <- df_train_new %>% select(-EC2)
df_train_new_EC2 <- df_train_new %>% select(-EC1)

sample_EC1=sample.split(df_train_new_EC1$EC1,SplitRatio=0.7)
train_EC1=subset(df_train_new_EC1,sample_EC1==T)
test_EC1=subset(df_train_new_EC1,sample_EC1==F)

sample_EC2=sample.split(df_train_new_EC2$EC2,SplitRatio=0.7)
train_EC2=subset(df_train_new_EC2,sample_EC2==T)
test_EC2=subset(df_train_new_EC2,sample_EC2==F)

🥳 Through the above code-chunk, we have successfully created separate train and test datasets for both the EC1 and EC2 indicators. The dataset is now ready for further processing.

xgb_model_EC1 <- xgboost(data = as.matrix(train_EC1 %>% select(-EC1)), label = as.matrix(train_EC1$EC1), 
                     max.depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
## [1]  train-logloss:0.604235 
## [2]  train-logloss:0.592298
xgb_model_EC2 <- xgboost(data = as.matrix(train_EC2 %>% select(-EC2)), label = as.matrix(train_EC2$EC2), 
                     max.depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
## [1]  train-logloss:0.499218 
## [2]  train-logloss:0.494251
xgb_EC1_importance <- xgb.importance(colnames(train_EC1 %>% select(-EC1)), model = xgb_model_EC1, 
               data = as.matrix(train_EC1 %>% select(-EC1)), label = train_EC1$EC1)

xgb_EC2_importance <- xgb.importance(colnames(train_EC2 %>% select(-EC2)), model = xgb_model_EC2, 
               data = as.matrix(train_EC2 %>% select(-EC2)), label = train_EC2$EC2)
pl5 <- ggplot(data =xgb_EC1_importance,aes(x = reorder(Feature,-Gain),y = round(Gain,2),fill=Feature)) + geom_col(color='black') + geom_label(aes(label = round(Gain,2))) + theme_classic() + ggtitle("Top 5 feature importances for EC1 \n using XGBoost") + labs(x = "Feature name",y = "Feature importance") + theme(plot.title = element_text(hjust = 0.5),legend.position = 'none',axis.title = element_text(face = 'bold'))
pl5
Feature importances for EC1 
 using XGBoost

Figure 4.8: Feature importances for EC1 using XGBoost

pl6 <- ggplot(data =xgb_EC2_importance,aes(x = reorder(Feature,-Gain),y = round(Gain,2),fill=Feature)) + geom_col(color='black') + geom_label(aes(label = round(Gain,2))) + theme_classic() + ggtitle("Top 5 feature importances for EC2 \n using XGBoost") + labs(x = "Feature name",y = "Feature importance") + theme(plot.title = element_text(hjust = 0.5),legend.position = 'none',axis.title = element_text(face = 'bold')) + scale_fill_brewer(palette = 'Accent')
pl6
Feature importances for EC2 
 using XGBoost

Figure 4.9: Feature importances for EC2 using XGBoost

Based on figures 4.8 and 4.9,

💡 we have obtained the top 5 features using the XGBoost model for each of the EC1 and EC2 indicators. This will help us concentrate our EDA efforts only on the most important features.💡

df_train_new_EC1_scaled <- as.data.frame(df_train_new_EC1 %>% filter(EC1 == 1) %>% scale(center=TRUE,scale=TRUE))
EC1_imp <- as.data.frame(df_train_new_EC1_scaled %>% select(-EC1) %>%  summarise_all(median,na.rm=TRUE))

EC1_imp <- rbind(rep(1,ncol(df_train_new_EC1_scaled -1)) , rep(-1,ncol(df_train_new_EC1_scaled -1)),EC1_imp)


df_train_new_EC2_scaled <- as.data.frame(df_train_new_EC2 %>% filter(EC2 == 1) %>% scale(center=TRUE,scale=TRUE))
EC2_imp <- as.data.frame(df_train_new_EC2_scaled %>% select(-EC2) %>%  summarise_all(median,na.rm=TRUE))

EC2_imp <- rbind(rep(1,ncol(df_train_new_EC2_scaled -1)) , rep(-1,ncol(df_train_new_EC2_scaled -1)),EC2_imp)
EC_imp <- EC1_imp
EC_imp <- rbind(EC1_imp,EC2_imp[3,])
rownames(EC_imp) <- 1:nrow(EC_imp)


areas <- c(rgb(1, 0, 0, 0.3),
           rgb(0, 1, 0, 0.3))

radarchart(EC_imp,
           axistype = 1,
           cglcol="gray", cglty=1, axislabcol="gray", caxislabels=seq(-1,1,0.5), cglwd=0.8,
           pcol = 2:4,      # Color for each line
           plwd = 4,        # Width for each line
           plty = 1,        # Line type for each line
           pfcol = areas,   # Color of the areas
           vlcex=0.5,      # Size of label
           title=paste("Median standaridised values for EC1 and EC2 indicators")
           
           )   

legend("topright",
       legend = paste(c("EC1","EC2")),
       bty = "n", pch = 20, col = areas,
       text.col = "grey25", pt.cex = 2)
Median standaridised values for EC1 and EC2 indicators

Figure 4.10: Median standaridised values for EC1 and EC2 indicators

After analysing 4.10, we observe that

💡 The median standardised values for both EC1 and EC2 indicators are very similar. The major difference however can be observed for the variable PEOE_VSA7. EC2 is observed to demonstrate a relatively lower scaled value when compared to EC1.💡

5 Classification model

5.1 Feature Engineering

As explained in 4.2, the dataset may benefit through a standardisation process. Let us standardise the variables to the same values as observed in figure 4.10.

df_train_new_EC1_scaled <- as.data.frame(df_train_new_EC1[,1:ncol(df_train_new_EC1) - 1] %>% scale(center = TRUE,scale = TRUE))
df_train_new_EC1_scaled <- cbind(df_train_new_EC1_scaled,EC1 = df_train_new_EC1$EC1)
df_train_new_EC2_scaled <- as.data.frame(df_train_new_EC2[,1:ncol(df_train_new_EC2) - 1] %>% scale(center = TRUE,scale = TRUE))
df_train_new_EC2_scaled <- cbind(df_train_new_EC2_scaled,EC2 = df_train_new_EC2$EC2)

5.2 Creating train and test dataset

set.seed(101)
sample_EC1=sample.split(df_train_new_EC1_scaled$EC1,SplitRatio=0.7)
train_EC1=subset(df_train_new_EC1_scaled,sample_EC1==T)
test_EC1=subset(df_train_new_EC1_scaled,sample_EC1==F)

sample_EC2=sample.split(df_train_new_EC2_scaled$EC2,SplitRatio=0.7)
train_EC2=subset(df_train_new_EC2_scaled,sample_EC2==T)
test_EC2=subset(df_train_new_EC2_scaled,sample_EC2==F)

5.3 Logistic Regression

We will first apply the logistic regression algorithm to perform classification for EC1 and EC2 indicators.

model_logit_EC1 <- glm(EC1~.,family=binomial(link='logit'),data=train_EC1)
pR2(model_logit_EC1)
## fitting null model for pseudo-r2
##           llh       llhNull            G2      McFadden          r2ML 
## -6.179552e+03 -6.603589e+03  8.480734e+02  6.421306e-02  7.840331e-02 
##          r2CU 
##  1.089552e-01
model_logit_EC2 <- glm(EC2~.,family=binomial(link='logit'),data=train_EC2)
pR2(model_logit_EC2)
## fitting null model for pseudo-r2
##           llh       llhNull            G2      McFadden          r2ML 
## -5.165338e+03 -5.212117e+03  9.355681e+01  8.974935e-03  8.967523e-03 
##          r2CU 
##  1.415614e-02
fitted.resultsEC1 <- predict(model_logit_EC1,newdata=subset(test_EC1,select=-(EC1)),type='response')
fitted.resultsEC1 <- ifelse(fitted.resultsEC1 > 0.5,1,0)

misClasificError <- mean(fitted.resultsEC1 != test_EC1$EC1)
print(paste('Accuracy of logistic regression:',1-misClasificError))
## [1] "Accuracy of logistic regression: 0.690406650190968"
fitted.resultsEC2 <- predict(model_logit_EC2,newdata=subset(test_EC2,select=-(EC2)),type='response')
fitted.resultsEC2 <- ifelse(fitted.resultsEC2 > 0.5,1,0)

misClasificError <- mean(fitted.resultsEC2 != test_EC2$EC2)
print(paste('Accuracy of logistic regression:',1-misClasificError))
## [1] "Accuracy of logistic regression: 0.79874213836478"
draw_confusion_matrix <- function(cm) {

  layout(matrix(c(1,1,2)))
  par(mar=c(2,2,2,2))
  plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
  title('CONFUSION MATRIX', cex.main=2)

  # create the matrix 
  rect(150, 430, 240, 370, col='#3F97D0')
  text(195, 435, 'False', cex=1.2)
  rect(250, 430, 340, 370, col='#F7AD50')
  text(295, 435, 'True', cex=1.2)
  text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
  text(245, 450, 'Actual', cex=1.3, font=2)
  rect(150, 305, 240, 365, col='#F7AD50')
  rect(250, 305, 340, 365, col='#3F97D0')
  text(140, 400, 'False', cex=1.2, srt=90)
  text(140, 335, 'True', cex=1.2, srt=90)

  # add in the cm results 
  res <- as.numeric(cm$table)
  text(195, 400, res[1], cex=1.6, font=2, col='white')
  text(195, 335, res[2], cex=1.6, font=2, col='white')
  text(295, 400, res[3], cex=1.6, font=2, col='white')
  text(295, 335, res[4], cex=1.6, font=2, col='white')

  # add in the specifics 
  plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
  text(10, 85, names(cm$byClass[1]), cex=1.2, font=2)
  text(10, 70, round(as.numeric(cm$byClass[1]), 3), cex=1.2)
  text(30, 85, names(cm$byClass[2]), cex=1.2, font=2)
  text(30, 70, round(as.numeric(cm$byClass[2]), 3), cex=1.2)
  text(50, 85, names(cm$byClass[5]), cex=1.2, font=2)
  text(50, 70, round(as.numeric(cm$byClass[5]), 3), cex=1.2)
  text(70, 85, names(cm$byClass[6]), cex=1.2, font=2)
  text(70, 70, round(as.numeric(cm$byClass[6]), 3), cex=1.2)
  text(90, 85, names(cm$byClass[7]), cex=1.2, font=2)
  text(90, 70, round(as.numeric(cm$byClass[7]), 3), cex=1.2)

  # add in the accuracy information 
  text(30, 35, names(cm$overall[1]), cex=1.5, font=2)
  text(30, 20, round(as.numeric(cm$overall[1]), 3), cex=1.4)
  text(70, 35, names(cm$overall[2]), cex=1.5, font=2)
  text(70, 20, round(as.numeric(cm$overall[2]), 3), cex=1.4)
}  

As we can observe,

💡 the logistic regression model was able to accurately predict approximately 70% of the cases with EC1 and EC2 indicators .💡

Let us further study the performance of the logistic regression model through the Receiver Operating Curve (ROC) metric.

p <- as.numeric(predict(model_logit_EC1, newdata=subset(test_EC1,select=-c(EC1)), type="response"))
q <- as.numeric(predict(model_logit_EC2, newdata=subset(test_EC2,select=-c(EC2)), type="response"))
pr <- prediction(p, test_EC1$EC1)
po <- prediction(q,test_EC2$EC2)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")

prf2 <- performance(po, measure = "tpr", x.measure = "fpr")
plot( prf, col = 'blue')
plot(prf2, add = TRUE, col = 'red')

legend("right", c("EC1", "EC2"), lty=1, 
    col = c("blue", "red"), bty="n", inset=c(0,-0.15))

title("Receiver Operating Curve for Logistic Regression")
Receiver Operating Curve for Logistic Regression

Figure 5.1: Receiver Operating Curve for Logistic Regression

Based on the analysis of the ROC plot in figure 5.1, we observe that,

💡 the logistic model is able to predict the cases for EC1 better than EC2 due to the curve covering higher area under the true positivity rate. 💡

cm_logit_EC1 <- confusionMatrix(factor(fitted.resultsEC1),factor(test_EC1$EC1))
draw_confusion_matrix(cm_logit_EC1)
Confusion matrix for Logistic Regression EC1

Figure 5.2: Confusion matrix for Logistic Regression EC1

cm_logit_EC2 <- confusionMatrix(factor(fitted.resultsEC2),factor(test_EC2$EC2))
draw_confusion_matrix(cm_logit_EC2)
Confusion matrix for Logistic Regression EC2

Figure 5.3: Confusion matrix for Logistic Regression EC2

5.4 XGboost

Let us try to use an extra gradient boosted ensemble method commonly termed as the XGboost classifier.

xgb_model_EC1 <- xgboost(data = as.matrix(train_EC1 %>% select(-EC1)), label = as.matrix(train_EC1$EC1), 
                     max.depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
## [1]  train-logloss:0.604235 
## [2]  train-logloss:0.592298
xgb_model_EC2 <- xgboost(data = as.matrix(train_EC2 %>% select(-EC2)), label = as.matrix(train_EC2$EC2), 
                     max.depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
## [1]  train-logloss:0.499218 
## [2]  train-logloss:0.494251
pred_xgb_EC1 <- predict(xgb_model_EC1, as.matrix(test_EC1 %>% select(-c(EC1))))
pred_xgb_EC1 <- if_else(pred_xgb_EC1 > 0.5,1,0)


pred_xgb_EC2 <- predict(xgb_model_EC2, as.matrix(test_EC2 %>% select(-c(EC2))))
pred_xgb_EC2 <- if_else(pred_xgb_EC2 > 0.5,1,0)
cm_xgb_EC1 <- confusionMatrix(factor(pred_xgb_EC1),factor(test_EC1$EC1))
draw_confusion_matrix(cm_xgb_EC1)
Confusion matrix of the XGBoost model for EC1

Figure 5.4: Confusion matrix of the XGBoost model for EC1

cm_xgb_EC2 <- confusionMatrix(factor(pred_xgb_EC2),factor(test_EC2$EC2))
draw_confusion_matrix(cm_xgb_EC2)
Confusion matrix of the XGBoost model for EC2

Figure 5.5: Confusion matrix of the XGBoost model for EC2